Skip to contents

This vignette describes how to go from variants (and optionally gene fusions) to the Minigene Library report.

We will usually work with standard processing pipelines for variant calling, RNA-seq quantification, and RNA fusion calling. The best-known collection of these workflows is NF-core.

It provides the following pipelines for these tasks:

  • sarek for variant calling using HaploytypeCaller and Mutect2
  • rnaseq for RNA-seq quantification using STAR
  • rnafusion for RNA fusion quantification using multiple algorithms

Running these will usually require access to high performance computing facilities, and you already get the processed files from your core facility or sequencing provider.

library(pepitope)

# SNPs and small indels, e.g. from 'sarek' nf-core pipeline
variant_vcf_file = "my_variants.vcf.gz"

# RNA counts and TPM file, e.g. from 'rnaseq' nf-core pipeline
rna_counts_file = "count_file.tsv"
rna_tpm_file = "tpm_file.tsv"

# Combined fusion VCF file, e.g. from 'rnafusion' nf-core pipeline
fusion_vcf_file = "my_fusions.vcf.gz"

Prerequisites

Selecting the right reference genome

The variant calling and RNA-seq counts were mapped to a reference genome and gene annotations. It is important to keep these consistent between the NF-core processing pipelines and the Minigene Library annotation.

We are usually working with BSgenome (for the reference genome) and EnsDb (for the gene annotations) objects. For human data, the most widely used reference genome is GRCh38, and a recent Ensembl annotation release.

With our test data, we know that GRCh38 is the correct reference genome and Ensembl 106 is the correct version of gene annotations. We can get both objects from the BSgenome and AnnotationHub Bioconductor packages, respectively.

ens106 = AnnotationHub::AnnotationHub()[["AH100643"]]
#> loading from cache
asm = BSgenome.Hsapiens.NCBI.GRCh38::BSgenome.Hsapiens.NCBI.GRCh38

A caveat here is that the chromosome prefixes need to be consistent between the variants in the VCF file and the genome/gene annotations. There are two “styles”, either UCSC (includes “chr” prefix) or NCBI/Ensembl (without “chr” prefix).

The sarek pipline uses UCSC prefixes on the GRCh38 genome, so we need to switch the genome and gene annotation styles:

seqlevelsStyle(ens106) = "UCSC"
seqlevelsStyle(asm) = "UCSC"
asm@seqinfo@genome[] = "GRCh38"

The correct styles will depend on how your VCF files were generated.

Adding RNA expression

The NF-core rnaseq workflow will provide two gene expression files, one for raw read counts and one for transcripts per million (TPM). These contain all samples in a run, so we need to subset them to the current sample we are interested in. These files are usually called:

  • salmon.merged.gene_counts.tsv
  • salmon.merged.gene_tpm.tsv

We can combine and subset them the following way:

counts = readr::read_tsv(rna_counts_file) |>
    dplyr::select(gene_id, gene_name, count=SAMPLE)
tpm = readr::read_tsv(rna_tpm_file) |>
    dplyr::select(gene_id, gene_name, tpm=SAMPLE)

rna_sample = inner_join(counts, tpm)

__ SHOW TABLE HERE __

SNPs and small indels

Reading and filtering mutations

vr1 = readVcfAsVRanges(variant_vcf_file) |>
    filter_variants(min_cov=2, min_af=0.05, pass=TRUE)

Here, we are using the following filters:

  • min_cov=2 – a variant needs to be covered by at least 2 reads
  • min_af=0.05 – a variant needs to occur in at leat 5% of reads
  • pass=TRUE – a variant needs to pass the standard QC filters

Annotating and subsetting expressed variants

ann = annotate_coding(vr1, ens106, asm)
subs = ann |>
#    filter_expressed(rna_sample, min_reads=1, min_tpm=0) |>
    subset_context(15)

Here, we are using the following filters:

  • min_reads=1 – the gene needs to have at least one RNA read
  • min_tpm=0 – we do not apply an additional TPM filter

In addition, we set the region of interest (context) to 15 codons up- and downstream of the variant. Hence, a SNP will have a total length of 93 nucleotides (15*3 + the SNP codon itself). An insertion will have the inserted sequence and 15 codons, a deletion only 15 codons both sides. A frameshift will have 15 codons upstream and the entire sequence downstream until a STOP codon is reached. The latter may extend into the 3’ UTR.

#> Warning in instance$preRenderHook(instance): It seems your data is too big for
#> client-side DataTables. You may consider server-side processing:
#> https://rstudio.github.io/DT/server.html

Fusion genes from RNA-seq

Annotating fusion genes

vr2 = readVcfAsVRanges(fusion_vcf_file) |>
    filter_fusions(min_reads=2, min_split_reads=1, min_tools=1)

seqlevelsStyle(vr2) = "UCSC"

Here, we are using the following filters:

  • min_reads=2 – the fusion needs to be supported by 2 split or pair distance reads
  • min_split_reads=1 – the fusion needs to be supported by at least one split read
  • min_tools=1 – the fusion needs to be reported by at least one tool

__ SHOW TABLE HERE __

Subsetting expressed fusion genes

fus = annotate_fusions(vr2, ens106, asm) |>
    filter_expressed(rna_sample, min_reads=1, min_tpm=0) |>
    subset_context(15)

Here, we are using the following filters:

  • min_reads=1 – the gene needs to have at least one RNA read
  • min_tpm=0 – we do not apply an additional TPM filter

In addition, we set the region of interest to 15 codons up- and downstream of the fusion site.

__ SHOW TABLE HERE __

Generating the Minigene Library

Tiling cDNAs of interest into smaller peptides

tiled = pep_tile(subs) |>
    remove_cutsite(BbsI="GAAGAC")

Saving a report file

We can then combine our generated tables into a report save it with the writexl package:

report = make_report(ann, subs, fus, tiled)
writexl::write_xlsx(report, "report_file.xlsx")

This .xlsx file will contain the different tables as sheets. We will use it as an annotation file in the quality control and screen steps.